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Abstract 

RNA sequencing (RNA-Seq) is rapidly replacing microarrays for profiling gene expression with much improved 
accuracy and sensitivity. One of the most common questions in a typical gene profiling experiment is how to 
identify a set of transcripts that are differentially expressed between different experimental conditions. Some of 
the statistical methods developed for microarray data analysis can be applied to RNA-Seq data with or without 
modifications. Recently several additional methods have been developed specifically for RNA-Seq data sets. 
This review attempts to give an in-depth review of these statistical methods, with the goal of providing a 
comprehensive guide when choosing appropriate metrics for RNA-Seq statistical analyses. 



Introduction 

Transcriptomics holds the key to understanding how the 
information encoded in the genome is translated into 
cellular functions, and how this translation process 
responds to the changing environment. Given a tran- 
scriptome, or the collection of all the transcripts includ- 
ing both protein coding mRNAs and noncoding RNAs, 
one of the outstanding questions in transcriptomics is to 
accurately quantify the abundance of each transcript 
within different tissues and time points, and to correlate 
changes in abundance to genetic and environmental 
perturbation in order to understand genome function 
and adaptation. 

Transcriptome profiling, or gene expression profiling, 
is the technology used to determine the steady state 
abundance of each transcript within a transcriptome. 
Transcriptome profiling is traditionally done using either 
quantitative RT-PCR (reviewed in [1]) to interrogate a 
few genes, or microarrays (cDNA array [2] or whole 
genome tiling array [3-5]) to investigate genome-wide 
transcriptional activity. Recently, as a result of the low 
cost of next generation sequencing technologies [6], 
transcriptome profiling by RNA sequencing (RNA-Seq) 
is becoming the method of choice because of its unpre- 
cedented sensitivity and accuracy (reviewed in [7,8]). 
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Unlike prior technologies, next-generation sequencing 
technologies allow reference transcriptomes to be assembled 
directly from RNA-Seq data, thereby eliminating the 
need for existing reference genomes or transcriptomes 
[9]. This capability is particularly attractive for non- 
model organisms or microbial communities that lack 
high quality references. 

There are both shared and unique aspects in the 
experimental design and data generation phases of 
expression microarrays and RNA-Seq technologies, and 
these attributes are compared elsewhere [7,8,10,11]. 
For data analysis there are three major steps for both 
technologies: data preprocessing, statistical analysis and 
functional interpretation (Figure 1). Preprocessing micro- 
array data normally includes background correction, 
normalization and summation, while preprocessing RNA- 
Seq data includes artifact filtering and short read align- 
ment/assembly. The bioinformatics details involved in 
preprocessing microarray and RNA-Seq data have been 
reviewed previously [8,12,13]. After data preprocessing 
the expression level of each transcript is determined. For 
microarrays the levels are often represented as continuous 
numbers, while for RNA-Seq datasets expression levels 
are represented as discrete read counts. Statistical anal- 
ysis is then performed to identify differentially expressed 
transcripts among different samples/conditions, and the 
results can be further analyzed to gain functional insights 
(Figure 1). In this review we focus on the statistical meth- 
ods that are used to identify differentially expressed tran- 
scripts in RNA-Seq experiments. Some of them (for 
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Figure 1 The analysis workflows of microarray and RNA-Seq data. 



example, likelihood ratio test) have been used for micro- 
array data analysis and then were adapted for RNA-Seq 
data analysis, while others were developed specifically for 
RNA-Seq. 

Over the past decade, various statistical analysis tools 
have been developed to analyze expression profiling data 
generated by microarrays (Reviewed in [14]). Before 
these tools can be applied to RNA-Seq data, it is worth 
noting that microarray data and RNA-Seq data are in- 
herently different. As mentioned earlier, microarray data 
is "analog" since expression levels are represented as 
continuous hybridization signal intensities. In contrast, 
RNA-Seq data is "digital", representing expression levels 
as discrete counts. This inherent difference leads to the 
difference in the parametric statistical methods that are 
used since they often depend on the assumptions of 
the random mechanism that generates the data. For 
example, the normal distribution is a common distribu- 
tion for statistical comparisons involving continuous 
data. It is generally assumed that the log intensities 
(or expression levels) in a microarray experiment are 
approximately normally distributed. In contrast, the 
Poisson, Binomial and Multinomial distributions are 
more suitable for modeling discrete data in an RNA-Seq 
experiment. Therefore a statistical method developed 
for microarray data analysis cannot be directly applied 
to RNA-Seq data analysis without first examining 
the underlying distributions. Recently several statistical 
methods have been developed to deal specifically 
with RNA-Seq count data [14-17]. In this review 



we summarize these methods, while focusing on the 
pros and cons of each method in the context of spe- 
cific applications. 

RNA-Seq count data 

As mentioned previously, the expression levels measured 
by RNA-Seq experiments are represented by the number 
of reads derived from each transcript in a transcriptome. 
We will not discuss the problem of resolving the expres- 
sion levels of alternatively spliced transcript isoforms 
from a single gene, since this is still a challenge and 
undergoing active research [7,18-20]. Here, for simpli- 
city, we use the term "gene" or "transcript" to generically 
refer to a spliced mRNA isoform, a non-coding RNA, a 
small RNA, or any product resulted from a transcrip- 
tional, splicing, or post-transcription-modification event. 

RNA-Seq count data can be organized into a numer- 
ical (py.n) matrix (M), with p representing the number 
of genes and n the number of samples. We use pheno- 
type to refer to an experimental condition, treatment, 
tissue, or time point. Typically the number of genes is 
far greater than the number of samples (/?»«). Another 
dimension often present in RNA-Seq datasets is the 
number of replicates. Since the RNA-Seq protocol is 
highly reproducible [21-23], technical replicates are usu- 
ally not necessary, and instead 2-3 biological replicates 
are used to reduce the degree of noise resulting from 
biological variations. Generally we assume that there are 
biological (or technical) replicates under the if 
phenotype, k=l, t. As a result, a typical RNA-Seq 
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data set has a series of two-dimensional sub-matrices 
containing non-negative integers (counts), with each 
sub-matrix being derived from a specific phenotype. 

Therefore, we have n = Y^k=i ni " an< ^ *- ne ( ^ a ' :a ma trix 
can be arranged As M below, with the element mjj 
being the expression level of the f gene from the /* 
replicate in the k th phenotype, i = 1, 2, ■ ■•,p,j = 1, 2, • • •, 
k=l,2, ■■-,t. 

RNA-Seq count data normalization 

An important consideration to make prior to statistical 
analysis is normalization. The sequencing depth, or 
library size, which is usually defined as the total number 
of aligned sequences in each sample, often varies from 
one sample to another. Denote ij^ as the sequencing 
depth for the /' sample in the tf h phenotype. Then 

we have Vj = Y^i>i m ij • Normalizing count data trans- 
forms it from discrete to continuous. For example, the 
RPKM metric (Reads Per Kilobase of transcript per Mil- 
lion mapped reads [22]) is used to measure the relative 
expression level of a transcript. Although RPKM consid- 
ers the length of the transcript, and thus allows for 
comparison among different transcripts, in most studies 
the gene length is not an issue because the comparison 
is made for the same gene between different conditions 
[24]. RPKM-based expression measurements cannot be 
directly used for the count-based models. 

In this review we assume that the data in the matrix 
M are raw read counts without normalization. Unlike 
microarray data analysis which often requires sophisti- 
cated normalization procedures to compensate for biases 
introduced from sample loading, imaging, and other 
technical or biological factors [25], RNA-Seq data typic- 
ally does not require a separate normalization step for 
two reasons: 1) The difference in the sequencing depths 
or library sizes between different samples is addressed 
through the parameterization of the underlying distribu- 
tions (see below). 2) Some models already take into ac- 
count the variation among biological replicates. For 
example, the over-dispersion parameter (discussed 
below) in the Negative Binomial model accounts for the 
variation across the biological replicates [26]. 

In the next section we will discuss the statistical meth- 
ods that have been developed to address whether or not 
a gene is differentially expressed among a group of time 



points or conditions. In the case of t = 2, this problem 
reduces to the two-group (pair-wise) comparison. 

Statistical methods to detect differentially 
expressed genes 

Several statistical methods have been proposed to detect 
the differentially expressed genes from a counts table 
(Table 1). The number of samples or replicates in a typ- 
ical RNA-Seq experiment is usually small, thereby 
excluding the application of nonparametric methods that 
implement sample permutations. For this reason, here 
we focus on parametric methods only. These methods 
differ in their underlying data distributions, how they 
handle biological replicates, and their ability to perform 
multi-group comparisons. Some of these methods have 
been implemented in related R/Bioconductor packages 
(Table 1). Each of these methods is discussed in further 
detail below. 

Methods based on the poisson distribution 

In an RNA-Seq dataset, the expression level of a specific 
gene, m\f\ is defined as the total number of short 
sequences which aligned to the gene. That is, it is the 
sum of a series of random events. Each event corre- 
sponds to a short sequence and follows a Bernoulli dis- 
tribution with the probability of success equating to 
the probability that the sequence aligns to the gene. 
Since the read alignments can be assumed to be inde- 
pendent, the distribution of m\f can be approximated by 
a Poisson distribution, Poii^), with being the 
mean. This Poisson model is verified in the case where 
there are only technical replicates using a single source 
of RNA [21]. For the i th gene, the statistical null hypoth- 
esis in testing different expression levels across pheno- 
types is that all of the means are equal. The statistical 
test procedures based upon Poisson modeling are 
reviewed in the next subsection. 

Fisher's exact test 

This method can be used for comparing two phenotypes 
(t = 2). For the i th gene, we can form a 2x2 contingency 
table for its expression values in the matrix M [27], 
Table 2. 

The Fisher's exact test is to test whether or not there 
exists a significant association between the gene expres- 
sion and the phenotype, in other words, whether or not 
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Table 1 A comparison of common statistical methods for RNA-Seq differential gene expression analysis 



Method 


Underlying 

H ictrihi itinn 

UI3LI IUUIIUII 


Recommended with 

hiolnniral rpnliratps: 
u i u i <j y IV. en i c ui \\- a ics 


Multi-group 

f e\ m na r i c: rt n 

LUI 1 1 yJa 1 IjUI 1 


R/Bioconductor 

oar kan £} 


Reference 


Fisher's exact Test 


Poisson 


No 


No 


No 


[27] 


Likelihood ratio test 


Poisson 


No 


Yes 


No 


[21] 


edgeR 


Negative Binomial 


Yes 


Yes 


Yes 


[28] 


DESeq 


Negative Binomial 


Yes 


No 


Yes 


[15] 


baySeq 


Negative Binomial 


Yes 


Yes 


Yes 


[17] 


BBSeq 


Beta-Binomia 


Yes 


No 


Yes 


[29] 


Two-stage poisson model 


Poisson 


Yes 


Yes 


No 


[30] 



the odds ratio is significantly greater or less than 1. This 
test is based on the fact that with the assumption of 
Poisson sampling and fixed marginal totals, the count 
ra'i follows a hyper-geometric distribution. The p value 
is the total of the hyper-geometric probabilities for out- 
comes at least as favorable to the alternative hypothesis 
(the gene expression in phenotype 1 is lower than that 
in phenotype 2 (the odds ratio is < 1), or vice versa) as 
the observed outcome [31]. A simple R function, 

t ■ I /(l) (2) ,(1) (1) ,(2) (2)\ 

> x = matrix I cm \ a , m a ,L\ — m a ,L\ — m a J , 
mow = 2) 



> fisher.test(x, alternative = ci two. sided , greater , less 



will give the p value of the test. 

Since the null hypothesis of independence in Fisher's 
exact test is equivalent to the null hypothesis that the 
odds ratio is equal to 1, one can avoid a potential false 
positive due to the difference in the sequencing depths. 
As an example, consider the case: rn^a = 10, ml? = 20, 
Li 1 =le + 6,Li = 2e + 6. The estimated odds ratio is 1 
(no association by Fisher's exact test), but the fold 
change is 2 (possibly declared as differential by other 
tests based on fold change). Furthermore, though Fish- 
er's exact test was designed for analyzing datasets with- 
out replicates, if there are replicates and Poisson 
sampling holds true, the test can still be applied - one 
simply sums up the replicates under the same condition 
to form the 2x2 contingency table. 

The p value obtained above is for a single gene. As in 
the analysis of expression data from microarray experi- 
ments, there are thousands of genes in one RNA-Seq 
experiment and thus we need to consider the problem 
of an inflated false positive rate due to multiple 



Table 2 The 2x2 contingency table for one gene 
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hypothesis testing. This problem can be addressed by dir- 
ecdy adjusting p values or calculating q values [32,33]. 
Many methods have been proposed to calculate adjusted p- 
values, including Bonferroni's single-step adjusted p- 
values, Holm's step-down adjusted p values [34], 
Hochberg's step-up adjusted p-values [35], and many more. 
The R function mt.raw2adjp() in the R/Bioconductor pack- 
age multtest computes adjusted p values, with nine dif- 
ferent computing procedures. Q values can be obtained 
by the R function qvalue() in the R/Bioconductor pack- 
age qvalue. All of these functions take the vector of raw 
p values as the input argument. 

Likelihood ratio test 

Marioni et al. assumed that the gene count mf > follows a 
Poisson distribution, PoUjJf* = L^vf*), where vf' represents 
the proportion of gene transcript copies of the ith gene in 
all samples under the kth phenotype (k = 1, 2, for pair-wise 
comparison), and then used the likelihood ratio test to 
identify the differentially expressed genes [21]. The purpose 
of incorporating the sequencing depth {Lj) parameter into 
the Poisson mean is to reduce the variation in sequencing 
depth. If we look into the significance of differential expres- 
sions of genes on a gene-by-gene basis, the likelihood func- 
tion and maximum likelihood estimations are easy to 
obtain. For the two-sided alternative hypothesis, by apply- 
ing simple algebraic operations we have the likelihood ratio 
test statistic, 



^ = - 2 ((E' 
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E ; .if + E ; .ij 



E ; 4> E^f + E^f, 
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And the p value for an individual gene can be calcu- 
lated as the right tailed probability of a Chi-squared dis- 
tribution with 1 degree of freedom. For the one-sided 
alternative hypothesis, v,- 1 ' > vf' the p value is half 
of the above right-tailed probability of the Chi-squared 
distribution if the unconstrained maximum likelihood 

estimates of vP.vf satisfy: [vf ] —j > 

p) ( = v) ); or 0.5 otherwise [36]. Adjusted p values 

or q values to control the false positive rate can be 
obtained by the methods described in the previous 
subsection. 

Poisson modeling is an appropriate fit not only for se- 
quencing data with technical replicates [21], but also for 
those with biological replicates, as long as the sample 
mean is close to the sample variance. However, the 
requirement that the variance is the same as the mean 
excludes the application of the Poisson model to RNA- 
Seq data, should over-dispersion (defined below) occur. 
The likelihood ratio test may give misleading results 
if the assumptions about the sampling distribution 
are violated. 

Models for over-dispersed count data 

Given a sampling distribution, over-dispersion occurs if 
the observed variance is greater than the assumed vari- 
ance. In the Poisson model, over-dispersion occurs if the 
sample variance is greater than the sample mean. There 
could be several sources that cause over-dispersion in 
RNA-Seq data, including the variability in biological 
replicates due to heterogeneity within a population of 
cells, possible correlation between gene expressions due 
to regulation, and other uncontrolled variations. The 
existence of over-dispersion in real data was observed 
in several previous studies [26,30]. Popular models to 
safeguard against over-dispersion include the negative 
binomial distribution, beta-binomial distribution or two- 
stage Poisson distribution, as discussed below. 

Negative binomial model 

As mentioned above, when over-dispersion is observed 
across the samples, the gene counts cannot be estimated 
accurately by a simple Poisson model. One way to handle 
this problem is to apply a Bayesian method - allowing the 
Poisson mean to be a random variable and then model the 
gene counts by the marginal distribution of m$. Specific- 
ally, assume that the Poisson mean follows a Gamma distri- 
bution with the scale parameter /Jf ) 4 > the shape 
parameter ( x / 4,), then the marginal distribution of the gene 
count has a Negative Binomial distribution with mean (4 
and the variance ^4 (1 + (Jf'cp) [37]. The Negative Binomial 
distribution can model the over-dispersed Poisson gene 



count where <p > 0 and reduces to the Poisson distribution 
as <p — > 0. The R/Bioconductor package "edgeR" applies this 
model to detect the differentially expressed genes in RNA- 
Seq data [28], where the mean of the Negative Binomial is 
rewritten as (f4 = Lj 'vj ) to adjust for the difference in se- 
quencing depths across the samples. The ith gene is differ- 
entially expressed if the parameters vp are significantly 
different across phenotypes. For simple, pair-wise compari- 
sons between phenotypes, the Negative Binomial para- 
meters are estimated by conditional maximum likelihood 
and quantile-adjusted conditional maximum likelihood 
[26,37], and then an exact test (similar to Fisher's exact 
test) is carried out to generate p values for individual 
genes. These can be completed by using the R function 
"exactTestO" with options for different estimates of the 
dispersion. For complex, mutligroup comparisons among 
phenotypes, edgeR applies the Cox-Reid profile-adjusted 
likelihood method to estimate the Negative Binomial 
parameters [38], and then uses the generalized linear 
model likelihood ratio test (R functions glmFit() and 
glmLRT()) to discover differentially expressed genes. 

While the relationship between the Negative Binomial 
mean and variance simplifies the estimation of these 
parameters, it may result in some variation unexplained 
by the model and thus potentially introduce selection 
biases in the differentially expressed genes. Anders 
and Huber extended the method in edgeR by hierarch- 
ically modeling this mean-variance relationship [15]. 
Their method is implemented in a R/Bioconductor 
package, called DESeq. The R function nbinomTest() or 
nbinomTestForMatrics() can return unadjusted p values 
for individual genes. The adjusted p values from the 
"BH" method [39] are also given by the first function. 
To our knowledge, DESeq is currently limited to pair- 
wise comparisons. 

Another modification to edgeR is given by Hardcasde 
and Kelly [17]. Their method is based on an empirically 
Bayesian approach and can be used for multi-group com- 
parisons as well as for pair-wise comparisons. The gene 
count is also modeled by the Negative Binomial distribu- 
tion. For each gene, instead of calculating a p value, a pos- 
terior probability is obtained for each comparison 
(alternative) among phenotypes. The probability of differen- 
tial expression of a gene is defined as the sum of the poster- 
ior probabilities for all possible comparisons. Then, the 
genes are ranked based upon the probability of differential 
expression This method is implemented in the R/Biocon- 
ductor package, baySeq. The R function getPosteriors() 
returns the posterior likelihoods of comparisons. One of 
the disadvantages of this method is that it is more compu- 
tationally expensive since all types of comparisons (alterna- 
tives) are considered and the number of comparisons 
increases dramatically when the number of phenotypes 
increases. 
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Beta-binomial model 

Zhou, Xia and Wright model the gene count in a sample 
with a Beta-Binomial distribution [29]. Assuming that 
whether or not a short sequence is mapped to a particular 
transcript follows a Bernoulli law with a mapping prob- 
ability 6, then 8 has a prior of the Beta distribution. The 
introduction of randomness to the mapping probability is 
to account for the over-dispersion in the gene count data, 
with the over-dispersion being explained by a Beta distri- 
bution parameter cf>. The maximum likelihood estimates of 
parameters (E(6), <J>) are obtained either by a free model in 
which both parameters are unrelated or by a constrained 
model in which a mean-overdispersion relationship is 
assumed. A logistic model is fitted to model the relation- 
ship between the mean E(6) and the design matrix of 
covariates including the indicator variables for pheno- 
types. Then the significance of an indicator variable is 
determined by a Wald statistic (the ratio of the corre- 
sponding coefficient in the fitted logistic model and its 
standard error) and indicates whether or not the gene is 
differentially expressed. This method is implemented 
in the R package, BBSeq, which is available on this web- 
site (last accessed on 03/07/2012): http://www.bios.unc. 
edu/research/genomic_software/BBSeq/. Two R functions, 
free.estimateO and constrained.estimateQ, can generate the 
raw p values for genes in pair-wise comparisons. However, 
no function in the package directly gives the p values for 
multi-group comparisons. 

The Beta-Binomial model has been widely used to 
model the over-dispersed, discrete count data. For 
example, it was applied to analyze tag count data derived 
from the Serial Analysis of Gene Expression (SAGE) 
[40]; tag count data obtained from Digital Gene Expres- 
sion profiling [41]; and spectral count data generated 
from label-free tandem mass spectrometry-based 



proteomic experiments [42]. To model RNA-Seq data 
with a Beta-Binomial distribution, the probability that a 
short sequence is mapped to a specific transcript is im- 
plicitly assumed to be constant for all short sequences in 
a sample. We have not seen any verification or justifica- 
tion of this assumption in the literature. 

Two-stage poisson model 

Auer and Doerge [30] proposed a Two-Stage Poisson 
Model for RNA-Seq data analysis, based upon the argu- 
ment that the over-dispersion is most likely caused by 
within group variation in expression if the experiment 
includes independent biological replicates without a 
significant population structure. The method consists of 
two steps. In the first step, genes are classified into two 
exclusive classes, genes with or without over-dispersion, 
by using an adjusted score test on the hypothesis of 
whether or not the over-dispersion is present within the 
count data of a gene. Then in the second step, for genes 
without significant over-dispersion, differential expres- 
sion is tested by a standard likelihood approach with 
maximum likelihood estimates being obtained under the 
Poisson model. Raw p values are calculated by an 
approximated Chi-squared distribution of degree one. 
For genes with significant over-dispersion, they use the 
quasi-likelihood statistic that is defined as the ratio of the 
likelihood statistic and an estimate of over-dispersion. 
Raw p values are calculated from an F-distribution. The 
built-in R functions "glmQ" and " devianceQ" can be used 
to obtain the likelihood ratio statistics. The detailed R code 
for p values can be downloaded from the authors web- 
site (http://www.stat.purdue.edu/~doerge/software/TSPM.R), 
last accessed on 03/07/2012. 

Under the model assumptions, the authors demon- 
strated that the Two-Stage Poisson Model is a powerful 
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tool for detecting differentially expressed genes. How- 
ever, if other confounding factors exist such that the 
levels of transcription within a phenotype are not stable, 
this method may not control the false positive rate well. 
Furthermore, as pointed out by the authors, this method 
requires a relatively higher degree of freedom (the differ- 
ence between the sample size and the number of pheno- 
types) in order to be effective. 

Conclusion and future perspectives 

Next-generation sequencing technologies are revolu- 
tionizing genomic/proteomic studies, providing high- 
throughput datasets with unprecedented precision and 
accuracy. The technology for profiling gene expressions 
and composition (RNA-Seq) has been widely applied in 
biological/medical research. Appropriate and powerful 
statistical analysis using RNA-Seq data is essential to 
the research. 

Generating an accurate list of differentially expressed 
genes is the basis for pathway or gene set enrichment 
analysis. A gene set with a large number of false posi- 
tives will compromise these analyses. The parametric 
approaches discussed here are preferable to nonpara- 
metric ones in order to increase the power of detection. 
However, the false positive rate may be dramatically 
increased if the assumptions of the model distribution 
are violated. In Figure 2, we demonstrate the difference 
between the histogram of 1000 data points simulated 
from an underlying distribution and the probability mass 
of an incorrectly fitted model (red curves). The data in 
the left panel are generated from a Negative Binomial 
distribution with mean 10 and over-dispersion 0.5 using 
the R function rnegbinQ, but are modeled by a Poisson 
distribution with the probability mass: 

/>(*)= -^j-, k= 0,1,2, , 

where the mean A = 10. Those in the right panel are gen- 
erated from a Poisson distribution with mean 10 using 
the R function rpoisQ, but are modeled by a Negative Bi- 
nomial distribution with the probability mass: 

rjk+r 1 ) ( i vV fi y 

p[q r(k+i)r{r 1 )V + ^) V + r 1 ) ' 

k = 0,1,2, , 

where r represents the gamma function and the mean 
^ = 10, the over-dispersion parameter 0=0.5. The values 
for the small circles in the fitted models are calculated 
as the product of 1000 and p(k). The differences in both 
cases are not negligible, indicating the seriousness of 
wrong assumptions about the model. 



To our knowledge, there is no paper in the literature 
which studies the efficacy of these methods when the 
model assumptions do not hold. Given the limitation of 
small sample sizes in RNA-Seq experiments, robust test 
procedures which safeguard against the departure of 
model assumptions are necessary. 

Most of the proposed methods produce raw p values 
for genes based upon the approximate/asymptotic 
null distribution. This approximation performs well for 
highly expressed genes but performs poorly for lowly 
expressed genes. This may create bias during the selec- 
tion of differentially expressed genes. Some authors 
simply filter out lowly expressed genes. This is very sub- 
jective, and some important genes, which are expressed 
in one condition but not in another, may be excluded 
from the result. New testing approaches, which are 
powerful and effective for both highly and lowly 
expressed genes, are still needed. 
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